The data is extracted from 1994 US census database and was found at the UCI ML repository: https://archive.ics.uci.edu/ml/datasets/Adult
We will try to analyze how different sociodemographical indicators affect the likelihood of a person earning more than 50,000$ a year.
The comments about the chunk are given before the chunk.
First, let’s import the dataset and format it a bit for easier exploration.
Initialize, read file, assign column names.
Change actual NA values to proper NA object, and drop unused levels.
library(ggplot2)
library(dplyr)
library(reshape2)
library(scales)
library(GGally)
library(gridExtra)
library(vcd)
library(psych)
library(heplots)
## Warning: package 'heplots' was built under R version 3.2.3
adult <- read.csv('~/DataAnalyst/Projects/DataAnalystND_Project_4/adult/adult.data', header = F)
names(adult) <- c('age','workclass','fnlwgt','education','education_num',
'marital_status','occupation','relationship','race','sex',
'capital_gain','capital_loss','hours_per_week',
'native_country','income')
levels(adult$income) <- c('low','high')
adult[adult ==' ?'] <- NA
adult <- droplevels(adult)
#This addition to ggplot plots will set alpha of the legend to 100% for better readability.
fix_alpha <- guides(colour = guide_legend(override.aes = list(alpha = 1)))
#This addition to ggplot plots will rotate the x-labels.
rotate_x <- function(angle = 30) theme(axis.text.x = element_text(angle = angle, hjust = 1))
#This will add the mean point, for boxplots.
mean_point <- stat_summary(fun.y=mean, colour="darkred", geom="point", shape=18, size=3,show_guide = FALSE)
#This will connect mean points with a line.
connect_means <- stat_summary(fun.y = mean, geom = 'line', aes(group = 1))
Arrange education levels by the provided ‘education_num’ variable.
Arrange other factors by frequency of high salary(High Salary Ratio, HSR) - from lowest to highest.
Remove unneeded columns.
Group very low frequency workclass levels together.
Remove spaces from factor levels.
#adult_by['variable'] - dataframe showing high salary ratio for each level of the variable.
adult_by <- list()
f1 <- function(x) x[1]
adult_by[['education']] <- adult %>%
group_by(education) %>%
summarize(num = f1(education_num)) %>%
arrange(num)
adult$education <- ordered(adult$education, levels = adult_by[['education']]$education)
adult <- subset(adult, select = -c(fnlwgt,education_num,capital_gain,capital_loss))
adult$workclass <- factor(gsub('Never-worked','No_pay',adult$workclass))
adult$workclass <- factor(gsub('Without-pay','No_pay',adult$workclass))
#remove spaces in levels
for (col in colnames(adult)) {
if (is.factor(adult[[col]])) {
levels(adult[[col]]) <- gsub(' ', '', levels(adult[[col]]))
}
}
for (col in names(adult[,-11])) {
adult_by[[col]]<- adult %>%
group_by_(col) %>%
summarise(hsr = sum(income == 'high')/n(),
n = n()) %>%
arrange(hsr)
#arrange levels of the factor variables by high salary ratio
if (!(col %in% c('age', 'hours_per_week','education'))) {
adult[[col]] <- ordered(adult[[col]], adult_by[[col]][[col]])
}
}
rm(f1, col)
I’ll refer to high salary ratio of a group (number of people from the group having high income, divided by group size) as HSR. HSR is indicating the income level of the group.
We’ve stored HSR’s for variable levels for each variable in the adult_by[[variable]] list.
HSR of the total population is 0.24.
as.double(nrow(adult[adult$income == "high",])) / nrow(adult)
## [1] 0.2408096
Age 75% of the people are under 50, with mean = 38.58 and median = 37. HSR increases from 16 to 50, then declines. For some reason, ages 79 and 83 have very high HSR.
summary(adult$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 17.00 28.00 37.00 38.58 48.00 90.00
hist_plot <- ggplot(data = adult, aes(x = age)) +
geom_histogram(binwidth = 1)
hsr_plot <- ggplot(data = adult_by[['age']], aes(x = age, y = hsr)) +
geom_line()
grid.arrange(hist_plot, hsr_plot)
Workclass By far most people work in Private sector which also has the lowest HSR.
Self-emp-inc (probably company owners) are paid very well, federal government staff are also paid well.
hist_plot <- ggplot(data = adult, aes(x = workclass)) +
geom_histogram() +
rotate_x()
hsr_plot <- ggplot(data = adult_by[['workclass']], aes(x = workclass, y = hsr)) +
geom_bar(stat = 'identity') +
rotate_x()
grid.arrange(hist_plot, hsr_plot)
Education Most of the people are HS-grads, Some-college or Bachelors. Having high education definitely increases income.
We arranged education by the (natural) education level, so we see that higher educated people earn more.
hist_plot <- ggplot(data = adult, aes(x = education)) +
geom_histogram() +
rotate_x()
hsr_plot <- ggplot(data = adult_by[['education']], aes(x = education, y = hsr)) +
geom_bar(stat = 'identity') +
rotate_x()
grid.arrange(hist_plot, hsr_plot)
Marital status Most of the people are Married-civ-spouse or Never-married. Married-civ-spouse and married-af-spouse have the highest HSR.
About 1/3 of the respondents were never married, and they have the lowest HSR.
hist_plot <- ggplot(data = adult, aes(x = marital_status)) +
geom_histogram() +
rotate_x()
hsr_plot <- ggplot(data = adult_by[['marital_status']], aes(x = marital_status, y = hsr)) +
geom_bar(stat = 'identity') +
rotate_x()
grid.arrange(hist_plot, hsr_plot)
85% of the HSR is due to the the Married-civ-spouse status.
print('Part of the high income people that have Married-civ-spouse: marital status: ')
## [1] "Part of the high income people that have Married-civ-spouse: marital status: "
print(nrow(subset(adult, adult$marital_status == 'Married-civ-spouse' & adult$income == 'high'))/nrow(subset(adult, adult$income == 'high')))
## [1] 0.8534626
Occupation About 25% of the data are Prof-speciality and Exec-managerial - two highest HSR categories.
Priv-house-serv occupation has HSR of only 0.6%.
hist_plot <- ggplot(data = adult, aes(x = occupation)) +
geom_histogram() +
rotate_x()
hsr_plot <- ggplot(data = adult_by[['occupation']], aes(x = occupation, y = hsr)) +
geom_bar(stat = 'identity') +
rotate_x()
grid.arrange(hist_plot, hsr_plot)
Relationship We have very little wives compared to husbands.
Wives and Husbands have very high HSR.
Wives have even higher HSR than Husbands, despite that women have HSR of 11% and men of 30.5%.
hist_plot <- ggplot(data = adult, aes(x = relationship)) +
geom_histogram() +
rotate_x()
hsr_plot <- ggplot(data = adult_by[['relationship']], aes(x = relationship, y = hsr)) +
geom_bar(stat = 'identity') +
rotate_x()
grid.arrange(hist_plot, hsr_plot)
Race Majority of the people are white. Black and Native American have half as high HSR than White and Asian.
hist_plot <- ggplot(data = adult, aes(x = race)) +
geom_histogram() +
rotate_x()
hsr_plot <- ggplot(data = adult_by[['race']], aes(x = race, y = hsr)) +
geom_bar(stat = 'identity') +
rotate_x()
grid.arrange(hist_plot, hsr_plot)
Sex For some reason, there are twice as many men as women in the survey.
Male - 67%, Female - 33%. HSR for men - 0.31, HSR for women - .11. HSR for men is 3 times higher than HSR for women.
hist_plot <- ggplot(data = adult, aes(x = sex)) +
geom_histogram() +
rotate_x()
hsr_plot <- ggplot(data = adult_by[['sex']], aes(x = sex, y = hsr)) +
geom_bar(stat = 'identity') +
rotate_x()
grid.arrange(hist_plot, hsr_plot)
Hours per week Most of the people work 40 hours, so for the histogram let’s use a log scale. Surprisingly, top HSR is at about 60 hpw(probably because the top-paid executives don’t work long hours), and people working 100 hpw have about the same HSR as standard 40-hpw people.
hist_plot <- ggplot(data = adult, aes(x = hours_per_week)) +
geom_histogram(binwidth = 2) +
scale_y_log10(breaks = c(1, 10, 100, 1000, 10000))
hsr_plot <- ggplot(data = adult_by[['hours_per_week']], aes(x = hours_per_week, y = hsr)) +
geom_line() +
geom_smooth()
grid.arrange(hist_plot, hsr_plot)
## Warning: Stacking not well defined when ymin != 0
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
Native country: 90% of the people are US-natives, so let’s use log scale.
Caribbean and Latin-American have the lowest HSR. US-natives are somewhere in the middle, and the top of the list cannot be attributed to some region in particualar.
ggplot(data = adult, aes(x = native_country)) +
geom_histogram() +
scale_y_log10() +
rotate_x(60)
ggplot(data = adult_by[['native_country']], aes(x = native_country, y = hsr)) +
geom_bar(stat = 'identity') +
rotate_x(60)
There are 32561 people of ages 17-90. There are 11 variables in my dataset: age, workclass, education, marital_status, occupation, relationship, race, sex, hours_per_week, native_country and income. The variables age and hours_per_week are integer variables, the other variables are factors. Medians for numerical data or modes(most frequent levels) for factors: age: 37 workclass: Private education: HS-grad > Some-college > Bachelors marital_status: Married-civ-spouse > Never-married occupation: Prof-speciality, Exec-managerial, Craft-repair, Adm-clerical, Sales, Other-service relationship: Husband > Not-in-family race: White sex: Male. hours_per_week: 40 native_country: United-States income: low - 86%, high - 24%
The dependent variable that we want to predict by other variables is income. The main features that I expect to be influencing income are age, sex, education, workclass and occupation.
I expect all the current features to be of interest. I removed features: fnlwgt - constructed variable (by the census takers), meaning of the variable unclear. education_num - duplicating education, took education ordering from it. capital_gain, and capital_loss - present only for small part of the data.
I didn’t create any additional variables.
The NA values were present as ‘?’, this was changed. I’ve made the ‘adult_by’ list, which contains HSR values for each level of each variable. Factor level ordering: I’ve ordered education by education_num variable. As there is no intrinsic ordering for other factors, I’ve ordered the levels by HSR.
As for distributions, we have much more men than women. People of ages 79 and 83 have unusually high HSR. The HSR of wives is a bit higher that for husbands, while HSR for women is much less than for men. As a curious addition, we have 2 male wives and 1 female husband. These are probably errors.
Women are younger than men by about 3 years. Women: median - 35, mean - 36.86 Men: median 38, mean - 39.43
by(adult$age, adult$sex, mean)
## adult$sex: Female
## [1] 36.85823
## --------------------------------------------------------
## adult$sex: Male
## [1] 39.43355
by(adult$age, adult$sex, median)
## adult$sex: Female
## [1] 35
## --------------------------------------------------------
## adult$sex: Male
## [1] 38
ggplot(adult, aes(x = sex, y = age, fill = sex)) +
geom_violin() +
mean_point
High-income people are older than low-income by about 9 years. High income: mean - 44.25, median - 44 Low income: mean - 36.78, median - 34
by(adult$age, adult$income, mean)
## adult$income: low
## [1] 36.78374
## --------------------------------------------------------
## adult$income: high
## [1] 44.24984
by(adult$age, adult$income, median)
## adult$income: low
## [1] 34
## --------------------------------------------------------
## adult$income: high
## [1] 44
ggplot(adult, aes(x = income, y = age, fill = income)) +
geom_violin() +
mean_point
Looks like most of the low income is in low hours per week.
ggplot(data = adult, aes(x = hours_per_week, fill = income)) +
geom_histogram(binwidth = 2, position = 'dodge') +
scale_y_log10(breaks = c(1,10,100,1000,10000))
Let’s see how much people with different education work. There’s a definite curve in the mean hpw (red diamonds). On average, people with only Preschool education work 36.46 hours while people with Doctorate degree work 46.97 hours.
by(adult$hours_per_week, adult$education, mean)
## adult$education: Preschool
## [1] 36.64706
## --------------------------------------------------------
## adult$education: 1st-4th
## [1] 38.25595
## --------------------------------------------------------
## adult$education: 5th-6th
## [1] 38.8979
## --------------------------------------------------------
## adult$education: 7th-8th
## [1] 39.36687
## --------------------------------------------------------
## adult$education: 9th
## [1] 38.04475
## --------------------------------------------------------
## adult$education: 10th
## [1] 37.05252
## --------------------------------------------------------
## adult$education: 11th
## [1] 33.92596
## --------------------------------------------------------
## adult$education: 12th
## [1] 35.7806
## --------------------------------------------------------
## adult$education: HS-grad
## [1] 40.57537
## --------------------------------------------------------
## adult$education: Some-college
## [1] 38.85228
## --------------------------------------------------------
## adult$education: Assoc-voc
## [1] 41.61071
## --------------------------------------------------------
## adult$education: Assoc-acdm
## [1] 40.50422
## --------------------------------------------------------
## adult$education: Bachelors
## [1] 42.61401
## --------------------------------------------------------
## adult$education: Masters
## [1] 43.83633
## --------------------------------------------------------
## adult$education: Prof-school
## [1] 47.42535
## --------------------------------------------------------
## adult$education: Doctorate
## [1] 46.97337
ggplot(adult, aes(x = education, y = hours_per_week, fill = education)) +
geom_boxplot() +
rotate_x() +
mean_point +
connect_means
All the races look pretty similar, only the Asian-Pac-Islander have a little less HS-grads and a little more Bachelors(and other high educations).
This is natural as Asian-Pac-Islander is the highest-HSR race.
ggplot(data = adult, aes(x = education)) +
geom_histogram() +
rotate_x(60) +
facet_wrap(~race, scales = 'free_y', ncol = 1)
HSR for men with high education is very high.
For both genders there is a breaking point at HS-grad (better seen on women). There are much more people with high income who have at least high school degree than people who don’t.
ggplot(data = adult, aes(x = education, fill = income)) +
geom_histogram() +
scale_y_log10() +
rotate_x() +
facet_wrap(~sex, ncol = 1)
ggplot(data = adult, aes(x = education, fill = income)) +
geom_histogram(position = 'dodge') +
scale_y_log10() +
rotate_x() +
facet_wrap(~sex, ncol = 1)
While the median is 40 hpw for both groups (40 hpw is standard), High-income people on average work 6.6 more hpw than low-income.
by(adult$hours_per_week,adult$income, mean)
## adult$income: low
## [1] 38.84021
## --------------------------------------------------------
## adult$income: high
## [1] 45.47303
ggplot(data = adult, aes(x = income, y = hours_per_week, fill = income)) +
geom_boxplot() +
mean_point
High-income people are on average 7.5 years older. Median age difference is 10 years.
by(adult$age,adult$income, mean)
## adult$income: low
## [1] 36.78374
## --------------------------------------------------------
## adult$income: high
## [1] 44.24984
by(adult$age,adult$income, median)
## adult$income: low
## [1] 34
## --------------------------------------------------------
## adult$income: high
## [1] 44
ggplot(data = adult, aes(x = income, y = age, fill = income)) +
geom_boxplot() +
mean_point
Most of the husbands and wives(highest HSR relationships) are married-civ-spouse(highest HSR marital status).
ggplot(adult, aes(x = relationship, fill = marital_status)) +
geom_histogram() +
rotate_x()
ggplot(adult, aes(x = marital_status, fill = relationship)) +
geom_histogram() +
rotate_x()
Most of the high-income is due to ‘Married-civ-spouse’ marital status or ‘Husband’ relationship.
ggplot(adult, aes(x = relationship)) +
geom_histogram() +
facet_wrap(~income) +
rotate_x()
ggplot(adult, aes(x = marital_status)) +
geom_histogram() +
facet_wrap(~income) +
rotate_x()
Younger people are mostly Never-married, middle-aged are Married or Divorced. Older people are mostly Married, Divorced or Widowed.
ggplot(data = adult, aes(x = age, fill = marital_status)) +
geom_histogram(binwidth = 1)
Most of the younger people are Own-child.
Middle-aged and older people are Husbands, Not-in-family or Unmarried.
As we’ve seen, number of husbands is much higher than number of wives.
ggplot(data = adult, aes(x = age, fill = relationship)) +
geom_histogram(binwidth = 1)
The majority of high-income people come from 2 highest-paid occupations.
ggplot(data = adult, aes(x = occupation, fill = income)) +
geom_histogram() +
rotate_x()
ggplot(data = adult, aes(x = occupation, fill = income)) +
geom_histogram() +
facet_wrap(~income, ncol = 1, scales = 'free_y') +
rotate_x()
The people in self-employed workclasses are mostly men.
NA category has the highest women/men ratio.
ggplot(data = adult, aes(x = workclass, fill = sex)) +
geom_bar(position = 'dodge') +
scale_y_log10(breaks = c(10,100,1000,10000)) +
rotate_x()
The majority of the difference in male-female populations is due to most commot ‘White’ race.
Black race has very high woman/man ratio.
ggplot(data = adult, aes(x = race, fill = sex)) +
geom_histogram(position = 'dodge') +
scale_y_log10(breaks = c(10,100,1000,10000))
Let’s see which occupations are dominated by either sex.
Remeber, that the occupations are arranged by HSR.
The highest female ratio is in the lowest-paid occupation(‘Priv-house-serv’).
Other female occupations: Adm-clerical and Other-service.
Male occupation: Handlers-cleaners, Armed-forces, Transport-moving, Craft-repair, Protective-serv.
Pretty much as expected.
#Dataframe that has female_ratio (num_females/num_total) for every occupation.
occupation_by_sex = adult %>%
group_by(occupation) %>%
summarise(female_ratio = sum(sex == 'Female')/n(),
n = n())
ggplot(data = occupation_by_sex, aes(x = occupation, y = female_ratio)) +
geom_bar(stat='identity') +
rotate_x()
The countries are ordered by HSR.
On average, low-income countries have more women (relative size of men/women is controlled by violin size). On all of the income scale there are countries with older men (Vietnam, Taiwan), and countries with older women (Peru, England).
ggplot(data = adult, aes(x = sex, y = age, fill = sex)) +
geom_violin(scale = 'count') +
mean_point +
connect_means +
facet_wrap(~native_country)
## Warning in max(data$n): у 'max' нет не пропущенных аргументов; возвращаю -
## Inf
## geom_path: Each group consist of only one observation. Do you need to adjust the group aesthetic?
Let’s look at the correlations. The correlation matrix is made as follows: Factor/Factor - Cramer’s V, Factor/Numerical - eta (ANOVA), Numerical/Numerical - Pearson’s r. The map is very approximate, as there are different correlation measures, but nevertheless we can make some conclusions. 1. Strong correlations between featured relationship - martial_status (as expected). They basically duplicate each other in terms of correlation, so we could consider removing one of them. relationship/marital_status - sex (as expected) sex - occupation (we’ve explored it) race - native country (as expected) age - relationship/marital_status hours_per_week correlates with most of the other features (except race/native_country and age) at an average amount. 2. Strong correlation with income income - relationship/marital_status income - education income - occupation 3. Weak correlations Race and native_country don’t correlate with anything besides themselves, and they have lowest correlation with income. We should consider removing both of them. As we know, about 90% of the people are White and US-native, so this feature isn’t really helpful. Education mainly correlates with income, and by a smaller amount with age, occupation, and hours_per_week. We don’t see significant correlation with sex, race/native_country or relationship. Age does not correlate with sex, race or native_country. As we expect these features to be independent, this indicates a rather good quality of the sample. We still don’t see any explanation of the gender distribution.
#Initialize the correlation matrix
cormat <- data.frame(matrix(NA, nrow = ncol(adult), ncol = ncol(adult)),row.names = names(adult))
names(cormat) <- names(adult)
#The 'text' dataframe will help to add stars(*) to indicate the type of correlation measure.
text <- cormat
#Supply correlation matrix with values. Factor/Factor - Cramer's V, Factor/Numerical - AOV eta, Numerical/Numerical - Pearson correlation.
for (v1 in names(adult)) {
for (v2 in names(adult)) {
if (is.factor(adult[[v1]])) {
if (is.factor(adult[[v2]])) {
if (v1 != v2) {
cormat[v1,v2] <- assocstats(xtabs(as.formula(paste('~',v1,'+',v2)), data = adult))$cramer
} else {
cormat[v1,v2] <- 1
}
text[v1,v2] <- paste(as.character(round(cormat[v1,v2],2)),'*', sep = '')
} else {
model.aov <- aov(as.formula(paste(v2,'~',v1)), data = adult)
cormat[v1,v2] <- (etasq(model.aov, partial = FALSE)$eta[1])^0.5
text[v1,v2] <- paste(as.character(round(cormat[v1,v2],2)),'**', sep = '')
}
} else {
if (is.factor(adult[[v2]])) {
model.aov <- aov(as.formula(paste(v1,'~',v2)), data = adult)
cormat[v1,v2] <- (etasq(model.aov, partial = FALSE)$eta[1])^0.5
text[v1,v2] <- paste(as.character(round(cormat[v1,v2],2)),'**', sep = '')
} else {
cormat[v1,v2] <- cor(adult[[v1]], adult[[v2]])
text[v1,v2] <- paste(as.character(round(cormat[v1,v2],2)),'***', sep = '')
}
}
}
}
#Melt the correlation matrix to give values to ggplot.
cormat_melt <- cormat
cormat_melt$variable2 <- names(cormat_melt)
cormat_melt <- melt(cormat_melt, id.vars = 'variable2')
cormat_melt$variable <- ordered(cormat_melt$variable, names(adult))
cormat_melt$variable2 <- ordered(cormat_melt$variable2, names(adult))
#Add the text (with stars).
text$name <- names(text)
cormat_melt$text <- melt(text, id.vars = 'name')$value
#Plot the heatmap
ggplot(cormat_melt, aes(variable, variable2)) +
geom_tile(aes(fill = value)) +
geom_text(aes(fill = cormat_melt$value, label = cormat_melt$text)) +
scale_fill_gradient(low = "white", high = "red") +
rotate_x()
#HSR for every combination of age and sex.
adult_by_age_sex<- adult %>%
group_by(age, sex) %>%
summarise(hsr = sum(income == 'high')/n(),
n = n()) %>%
arrange(age, sex)
ggplot(adult_by_age_sex, aes(x = age, y = hsr, color = sex)) +
geom_line(size = 1) +
geom_smooth() +
ylim(c(0,NA))
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## Warning: Removed 4 rows containing missing values (geom_path).
## Warning: Removed 4 rows containing missing values (geom_path).
There are too many data points, so let’s only look at the regressions.
#HSR for every combination of age and education
adult_by_age_education<- adult %>%
group_by(age, education) %>%
summarise(hsr = sum(income == 'high')/n(),
n = n()) %>%
arrange(age, education)
ggplot(adult_by_age_education, aes(x = age, y = hsr, color = education)) +
geom_smooth(se = FALSE) +
ylim(0,NA)
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## Warning: Removed 13 rows containing missing values (geom_path).
## Warning: Removed 10 rows containing missing values (geom_path).
## Warning: Removed 3 rows containing missing values (geom_path).
## Warning: Removed 10 rows containing missing values (geom_path).
## Warning: Removed 6 rows containing missing values (geom_path).
## Warning: Removed 13 rows containing missing values (geom_path).
## Warning: Removed 4 rows containing missing values (geom_path).
## Warning: Removed 4 rows containing missing values (geom_path).
## Warning: Removed 3 rows containing missing values (geom_path).
## Warning: Removed 2 rows containing missing values (geom_path).
## Warning: Removed 3 rows containing missing values (geom_path).
## Warning: Removed 3 rows containing missing values (geom_path).
As we’ve already seen, people with better education are older. No surprise here.
People who have have average education have very high age difference between high and low income.
ggplot(data = adult, aes(x = education, y = age, fill = income)) +
geom_boxplot()
We see that people of the lower education categories (from preschool to about 9th grade) are older than the subsequent categories.
That is probably because these people are most likely dropouts (and they could drop out a while ago), while people with 10th grade education or higher could be still studying (we have respondents of age 16+ in the survey).
Men with lower education are younger than women with same education, and the opposite is true for higher education.
ggplot(data = adult, aes(x = education, y = age, fill = sex)) +
geom_boxplot()
We see that in Own-child and Husband/Wife relationships women are older, and in other relationships women are younger.
ggplot(data = adult, aes(x = relationship, y = age, fill = sex)) +
geom_boxplot()
Older people and women work less.
ggplot(data = adult, aes(x = age, y = hours_per_week, color = sex)) +
geom_jitter(alpha = 0.2) +
fix_alpha +
geom_smooth(alpha = 1, size = 2, se = FALSE)
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
Let’s break it by sexes.
The distributions of every relationship except Unmarried and Husband/Wife are the same. There are more unmarried women than men, and there are way more husbands than wives.
ggplot(data = adult, aes(x = age, fill = relationship)) +
geom_histogram(binwidth = 1) +
facet_wrap(~sex, ncol = 1)
ggplot(data = adult, aes(x = age)) +
geom_histogram(binwidth = 1) +
facet_wrap(~sex + relationship, ncol = 6)
ggplot(data = adult, aes(x = age, fill = relationship)) +
geom_histogram(binwidth = 1) +
facet_wrap(~sex, ncol = 1) +
xlab('Age, years') +
ylab('Number of Respondents') +
ggtitle('Histogram of Ages, by Sex and relationship')
The distribution of women looks very different to the distribution of men.
There are less women than men and they are younger.
The histograms of all the relationships for men and women, except husband and wife, look the same.
ggplot(data = adult_by[['hours_per_week']], aes(x = hours_per_week, y = hsr)) +
geom_line() +
geom_smooth() +
xlab('Hours per week working') +
ylab('High salary ratio (frequency of high-income people)') +
ggtitle('Relation of income to working hours per week')
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
The lowest HSR is at 0-25 hpw, and after that HSR rises up to about 60 hpw.
But after 60 hpw the average hsr decreases.
We see that people who work about 100 hours a week earn about the same as people who work 40 hours.
ggplot(data = occupation_by_sex, aes(x = occupation, y = female_ratio)) +
geom_bar(stat='identity') +
scale_y_continuous(labels = percent) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ylab('Percentage of females in occupation') +
xlab('Occupation') +
ggtitle('Occupations by gender')
Here we can see which professions are dominated by either gender.
The occupations are ordered by HSR (leftmost occupation is worst-paid).
The occupations that have more women than others: Priv-House-Serv(lowest-paid occupation), Adm-clerical and Other-service.
Male occupations: Handlers-cleaners, Armed-forces, Transport-moving, Craft-repair, Protective-serv.
The dataset has very interesting information that characterizes the respondents in different sociodemographical ways. Nevertheless, we can say with fair amount of certainty, that the dataset is not representative of the US population, for example by gender and race distributions.
In the dataset, women and black/native americans are paid much less than men and white/asian people. This does not necessarily imply discrimination, but is suspicious.
People originating from different countries differ in income, and most of the low-income countries are in Latin America. Most of the high-income countries are rich and developed, although it is surprising to see Cambodia, Yugoslavia and even Iran on the top of the list. Anyway, 90% of the people are US-natives, so this variable doesn’t give too much information.
The dataset mostly contains categorical variables, having more numeric variables would be interesting.
Mainly, if our dependent variable (income) was numeric, it would open more opportunities for exploration.
It would be also very interesting to have the same data for a different year to examine the trends.